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Classical penalized likelihood regression problems deal with the 
case that the independent variables data are known exactly. In prac- 
tice, however, it is common to observe data with incomplete covariate 
information. We are concerned with a fundamentally important case 
where some of the observations do not represent the exact covariate 
information, but only a probability distribution. In this case, the max- 
imum penalized likelihood method can be still applied to estimating 
the regression function. We first show that the maximum penalized 
likelihood estimate exists under a mild condition. In the computation, 
we propose a dimension reduction technique to minimize the penal- 
ized likelihood and derive a GACV (Generalized Approximate Cross 
Validation) to choose the smoothing parameter. Our methods are ex- 
tended to handle more complicated incomplete data problems, such 
as, covariate measurement error and partially missing covariates. 
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1. Introduction. 

1.1. Penalized likelihood regression in reproducing kernel Hilbert spaces. 
We are concerned with non or semi parametric regression for data from 
a non-Gaussian exponential family. Suppose that we have n independent 
observations (yi,Xi),i = l,...,n, where each yi denotes the response and 
each Xi denotes the covariate information. The goal is to fit a probability 
mechanism, assuming that the conditional distribution of yi given Xi has a 
density in the exponential family with the form 

(1.1) p(yi\xi, f) = exp{(y; • f(xi) - b(f(xi)))/a(4>) + c(yi, 4>)} 

where &(•) and c(-) are given functions with b(-) strictly convex, (j> is the 
scale parameter and / is the regression function to be estimated. We assume 
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throughout this paper that <j> is known, as, for example, Binomial data and 
Poisson data. In this case, (1.1) can be simplified by 



Note that the methods of this paper can also be extended to the situation 
when (j) is unknown, but may be more computationally complicated. 

The regression function / will be estimated non or semi parametrically 
in some reproducing kernel Hilbert space (RKHS) T~L by minimizing the 
penalized likelihood 



where the penalty J(-) is a norm or semi-norm in T~L with finite dimensional 
null space T-Lq = {/ & % \ J(f) = 0} and A is the smoothing parameter 
which balances the tradeoff between model fitting and smoothness. In this 
case, if the null space H.q satisfies some condition, saying that I\(f) has a 
unique minimizer in T-Lo, then the minimizer of I\(f) in T~L exists in a known 
n-dimensional subspace spanned by Ho and functions of the reproducing ker- 
nel. See, for example, Kimeldorf and Wahba (1971) [25], O'Sullivan, Yandell 
and Raynor (1983) [32], Wahba (1990) [35] and Xiang and Wahba (1996) [36]. 
This model building technique, known as penalized likelihood regression 
with RKHS penalty, allows for more flexibility than parametric regression 
models. We will not review the general literature, other than to note two 
books and references therein. Wahba (1990) [35] offers a general introduc- 
tion of spline models. Gu (2002) [17] comprehensively reviews the smoothing 
spline analysis of variance (SS-ANOVA), an important implementation of 
penalized likelihood regression in multivariate function estimation. 

1.2. Randomized covariate data and related problems. In this paper, the 
issue we are concerned about is the situation where components of Xi are not 
observable but only known to have come from a particular probability dis- 
tribution. This concept of randomized covariate, without the requirement of 
any actual measure of Xj, is more flexible than the common sense of covariate 
measurement error. In this natural likelihood-based approach is to 

treat X% S clS latent variables and minimize a randomized version of penalized 
likelihood that integrates Xj's out of the likelihood. This approach, however, 
typically leads to a non-convex and infinite dimensional optimization prob- 
lem in RKHS. Therefore we shall first prove that the randomized penalized 
likelihood is minimizable. This is the subject of Section 2. Afterwards, two 



(1.2) 



p{Vi\Xh f) = exp{i/i • f(xi) - b(f(xi)) + c(yi)}. 



(1.3) 



1 A 

us) = — y>gp(2/iizi,/) + -zju) 
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computational issues will be addressed in Section 3: (1) how to numerically 
compute an estimator and (2) how to select the smoothing parameter. 

Randomized covariate data is a basic version of incomplete data. Our 
methods can be extended to other incomplete data problems. For example, 
in the survey or medical research, it is common to obtain data where the 
covariates are measured with error. More specifically, x\ is not directly ob- 
served but instead xf r = Xi + U{ is observed, where U{,i = 1, ...,n are iid 
random perturbations. Fan and Truong (1993) [12] regarded this measure- 
ment error problem in the context of nonparametric regression, using the 
methods based on kernel deconvolution. Their technique was later studied 
and extended by, for example, Ioannides and Alevizo (1997) [21], Schennach 
(2004) [31], Carroll, Ruppert and Stefanski (2006) [6] and Delaigle, Fan and 
Carroll (2009) [11]. More recently, penalized likelihood regression have been 
considered in the measurement error literature. Carroll, Maca and Ruppert 
(1999)[5] suggested to use the SIMEX method (Cook and Stefanski, 1994[9]) 
to build nonparametric regression models including both kernel regression 
and penalized likelihood regression. Berry, Carroll and Ruppert (2001) [2] de- 
scribed Bayesian approaches for smoothing splines and regression P-splines. 
Cardot, Crambes, Kneip and Sarda (2007) [4] used the total least square 
method (Van Huffel and Vandewalle, 1991 [33]) to compute a smoothing 
spline estimator from noisy covariates. These papers mainly discussed the 
situation of Gaussian responses but very little literature concerns other re- 
sponses. As a sequel to these works, in this paper, we treat measurement 
error as a special case of randomized covariates, because each X{ can be 
viewed as a random variable (vector) distributed as xf T — Ui. Therefore the 
methodology of randomized penalized likelihood estimate can be employed. 

We will as well be able to make another modest extension to treat the 
important situation where some components of some Xj's are completely 
missing. In this case, we may write xi = (x° bs , x™" ls ) , where x° bs and x™ s 
denote the observed and the missing components. It is well-known (Lit- 
tle and Rubin, 2002 [29]) that a complete case analysis that deletes the 
cases with missing information often leads to bias or inefficient estimates. 
Various methods for missing covariate data have been developed in the 
context of parametric regression models, but to date few methods have 
been proposed for nonparametric penalized likelihood regression in RKHS. 
For parametric regression, one popular approach is the method of weights 
initially proposed by Ibrahim (1990) [18]. His suggestion is to assume the 
£j's to be independent observations from a marginal distribution depend- 
ing on some parameters and to maximize the joint distribution of (yi,Xi) 
by the expectation-maximization (EM) algorithm. Discussions and exten- 
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sions of this method appear in Ibrahim, Lipsitz and Chen (1999) [19], Hor- 
ton and Laird (1999) [22], Huang, Chen and Ibrahim (2005) [24], Ibrahim, 
Chen, Lipsitz and Herring (2005) [20], Horton and Kleinman (2007) [23], Chen 
and Ibrahim (2006) [7], Chen, Zeng and Ibrahim (2007) [8] and elsewhere. 
Ibrahim's method can also be employed to build nonparametric regression 
models. Actually, in the framework of Ibrahim's method, the missing com- 
ponents x™ ls can be viewed as a random vector depending on the observed 
components x° bs and the covariate marginal distribution. Therefore in this 
paper, missing covariate data is treated as a special case of randomized 
covariate data, and thus our methods can be extended. 

1.3. Outline of paper. The rest of the paper is organized as follows. In 
Section 2, we prove the existence of the randomized covariate penalized like- 
lihood estimation in the general smoothing spline set-up. Computational 
techniques are presented in Section 3. Sections 4 and 5 extend our methods 
to the problem of covariate measurement error. Sections 6 and 7 describe pe- 
nalized likelihood regression with missing covariate data. Section 8 provides 
some numerical results. We conclude our paper in Section 9. 

2. Randomized covariate penalized likelihood estimation (the- 
ory). Consider the general smoothing spline set-up, where x is allowed to 
be from some arbitrary index set & on which an RHKS can be defined. 
Randomized covariate data is defined in the way that we "observe" for each 
subject i a probability space (X. L ,Fi, Pi), rather than a realization of Xj, where 
Xi Q & denotes the domain of Xi, T\ is a a— algebra and Pi is a probability 
measure over (Xi, F). 

In this case, each be treated as a latent random variable. Thus, 

given a regression function /, the distribution of [yi\f] has a density 



Note that, throughout this paper, we use the labels [A\B] and p(A\B) to 
denote the conditional distribution of A given B and the density function 
for this distribution. According to (2.1), the penalized likelihood estimate of 
/ is the minimizer of 



where R denotes the "randomness" of the covariates and / is restricted on 
the Borel measurable subset 



(2.1) 




(2.2) 



i?(/) = --y>g / P (y i \x i ,f)dP i + ^J(f) 
n JXi 2 



(2.3) %b = {/ £ % : / is Borel measurable on (Xi,Fi), i = 1, n} 
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in which the Lebesgue integrals in (2.2) can be defined. It can be shown that 
Hb is a subspace of H. 

PROPOSITION 2.1. H B is a subspace ofU. 

Proof See Appendix A. □ 

This methodology can be referred to as randomized covariate pe- 
nalized likelihood estimation or RC-PLE. Note that RC-PLE includes 
the classical penalized likelihood regression where observed exactly. 

Actually, I\{f) equals I\(f) if every [Xi,Fi,Pi) stands for a single point 
probability. 

However, computation of RC-PLE is extremely difficult. Firstly, since 
each p{yi\Xi,f) is log-concave as a function of /, I\(f) is in general not 
convex due to the integrals. Secondly, if at least one (Xi, J-i, Pi) has infinite 
support, then there is no finite dimensional subspace in which f\ is known 
a priori to lie, as can be concluded from the arguments in Kimeldorf and 
Wahba (1971) [25]. Therefore, we shall first prove that I^if) is minimizable 
and hence the phrase "penalized likelihood estimate" is meaningful. Com- 
putational techniques will be described in Section 3. 

Recall that for the classical penalized likelihood regression, the unique so- 
lution in the null space is sufficient to ensure the existence of the penalized 
likelihood estimate. In the case of randomized covariate data, we extend this 
condition as follows: 

ASSUMPTION A.l (Null space condition). There exist exactly observed 
subjects (yfc^xfcj, (y k2 ,x k2 ), {yk s ,x ks ) such that X)i=i ^SP^il 3 ^ /) has 
a unique maximizer in %o- 

Now we state our main theorem. 

THEOREM 2.2. Under A.l, 3f x e U B such thatl^(fx) = inf /eHs Jf (/). 

Theorem 2.2 guarantees the existence of the RC-PLE estimate, which 
justifies the title of the paper. In particular, if the null space of the penalty 
functional J(-) contains only constants, then A.l can be ignored. In this 
case, the penalized likelihood estimate always exists. 

Our proof of the theorem is based on lower-semicontinuity in the weak 
topology. We first recall some definitions. 
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DEFINITION 1. A sequence {/fcjfcgN in a Hilbert space ft is said to con- 
verge weakly to / if (fk,g) — > (f,g) for all g £ ft. Here (•,•) denotes the 
inner product of ft. 

DEFINITION 2. Let ft be a Hilbert space, a functional 7 : ft ->■ R 
is (weakly) sequentially lower semicontinuous at / £ ft if 7(/) < 
liminf 7(// c ) for any sequence {/fcjfceN that (weakly) converges to /. 

DEFINITION 3. Let ft be a Hilbert space, a functional 7 : ft -> R is 
positively coercive if — * +00 implies "f(f) = +00. Here || • ||% de- 
notes the norm of ft . 

Theorem 2.2 can be shown by combining Proposition 2.3 and Lemmas 
2.4-2.6 below. Note that Proposition 2.3 is obtained from Theorem 7.3.7 
in Kurdila and Zabarankin (2005) [27], Page 217. The proofs of lemmas are 
given in Appendix A. 

PROPOSITION 2.3. Let H be a Hilbert space. Suppose that 7 : M C ft ->■ 
R is positively coercive and weakly sequentially lower semicontinuous over 
the closed and convex set A4, then 3fo £ A4 such that 7(/o) = inf /gx 7(/)- 

LEMMA 2.4. Under A.l, the penalized likelihood I?{f) is positively coer- 
cive over %b- 

LEMMA 2.5. The functional log J x p(yi\xi, f)dPi : Hb — > R is weakly 
sequentially continuous. 

LEMMA 2.6. The penalty functional J(-) is weakly sequentially lower 
semi- continuous. 

Proof of Theorem 2.2. Consider the functional 1^ : Hb ^ ft — > R- 

Theorem 2.2 follows from Proposition 2.2, Lemma 2.4-2.6 and Proposition 
2.3 □. 

3. Randomized covariate penalized likelihood estimation (com- 
putation). In the preceding section, we theoretically extended penalized 
likelihood regression in RKHS to randomized covariate data, where / was 
restricted on the Borel measurable subspace ft 5. In practical applications, 
however, we often face the case that all functions in the RKHS are Borel 
measurable. In this case, we no longer need the restriction mentioned in 
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(2.3). Thus, we would like to proceed our discussion under the following 
condition 

ASSUMPTION A.2. Consider the Borel-cr field of U (generated by the 
open sets). Mapping: 

ST -> U 

x ^ K x (-)=K(-,x) 

is Borel measurable for all (^,7^), i = l,...n. Here K{-,-) denotes the re- 
producing kernel of H. 

Under A. 2, by Theorem 90 of Berlinet and Thomas- Agnan (2004) [1], Page 
195, every function in % is Borel measurable. It can be verified that if the 
domain 3? C M d and every T% is a Borel cr-field, then A. 2 is satisfied with 

• Every continuous kernel; 

• Kernels built from tensor sums or products of continuous kernels; 

• Any radial basis kernel K(x, z) = r(\\x — z\\d) such that r(-) is contin- 
uous at 0. Here || • ||^ denotes the usual Euclidian norm. 

3.1. Quadrature penalized likelihood estimates. As previously discussed, 
there is in general no finite dimensional subspace in which the RC-PLE 
estimate fx is known a priori to lie, so direct computation is not attractive. 
In this case we shall find a finite dimensional approximating subspace and 
compute an estimator in this space. We consider the following penalized 
likelihood: 

n mi . 

(3.1) /f' n (/) = --^log^Tr^N;,/) + 2 J (f) 

i=l 3=1 

where Z = {zn, z lmi , z 21 , ...,z nmn } with z i3 - G 2? and n = {irn, ...,ir lmi , 
7T2i, ^nm n } with iTij > 0. In words, when we evaluate the integrals on the 
right hand side of (2.2), each {X^T^ Pi) is replaced by a discrete probability 
distribution defined over {zn, Zi2, Zi mi } with probability mass function 
P(xi = Zij) = 7Tjj , j = 1, ...,m». Thus Zij, 1 < j < m» and mj, 1 < j < m» 
are referred to as nodes and weights of a quadrature rule for probability 
measure P{. 

In (3.1), / is only evaluated on a finite number of quadrature nodes. Under 
A.l, it can be seen from Theorem 2.2 and the arguments in Kimeldorf and 

Z IT 

Wahba (1971) [25] that the minimizer of ' (/) in 7~L is in a finite dimen- 

z n 

sional subspace Hz spanned by T~Lq and {K(-,Z{j) : z^ £ Z}. Thus, I x ' (/) 
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can be formulated as a parametric penalized likelihood. Green (1990) [16] 
gave a general discussion on the use of the EM algorithm for parametric 
penalized likelihood estimation with incomplete data. His method can be 
extended to minimize h ' (/). It can be shown that the E-step at iteration 
t+1 has the form of 

(3.2) Q(f\f {t) ) = -EE4 } -logP^N,/) " ^J(f) 

n i=i j=i 

where is estimated at iteration t and the weight 

(3.3) w {t) - "M^MlI^l 



indicates the conditional probability of [zij\yi, f^']- The M-step updates / 
by maximizing Q(f\f^) in %. This is straightforward because —Q(f\f^) 
is seen to be a weighted complete data penalized likelihood. 

When the EM algorithm converges, we will obtain an estimator f\ which 
approximates the RC-PLE estimate f\. Note that f\ can be interpreted as 
a minimizer of I\{f) when the integrals are approximated by quadrature 
rules. Hence, this computational technique is referred to as quadrature 
penalized likelihood estimation or QPLE. The motivation behind this 
approach is that an efficient quadrature rule often requires only a few nodes 
for a good approximation to the integral. This convenient property eases the 
computation burden at each M-step. 

3.2. Construction of quadrature rules. Construction of quadrature rules 
is a practical issue. In order to derive more applicable results, we further 
assume that each xi = (xn, ...,Xid) T is a random vector, i.e., ^ C M. d . 

3.2.1. Univariate quadrature rules. Suppose that xi is univariate (i.e., 
d = 1). In this case, if Xi is a categorical random variable or exactly ob- 
served, then (Xi,Pi) itself can be used as a quadrature rule. Otherwise, if 
continuous random variable, we will construct a Gaussian quadra- 
ture rule. Development of computational methods and routines of Gaussian 
quadrature integration formulae for probability measures is a mathematical 
research topic. We will not survey the general literature here, other than 
to say that the methods considered in this paper can be obtained from, 
Golub and Welsch (1969) [15], Fernandes and Atchley (2006) [13], Bosserhoff 
(2008) [3] and Rahman (2009) [30]. Though a k-node Gaussian quadrature 
rule typically requires the first 2k moments of the measure Pi to be finite, 
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this convention can be satisfied by most popular probability distributions 
including normal, uniform, exponential, gamma, beta and others. Besides 
Gaussian quadrature rules, if Xi has a density with respect to the Lebesgue 
measure, we also consider a quadrature rule with equally-spaced points. 
More specifically, suppose that x% ranges over [a,b], then we take equally- 
spaced points in [a, b] as quadrature nodes while the quadrature weights 
are proportional to the density evaluated at the chosen nodes. Note that if 
a = — oo (or b = +00), we set a = fii — 3<7j (or b = /ij + 3<7j) where m 
and <7j denote the first and second moments of Pi. We refer to this simple 
quadrature rule as the grid quadrature rule. 

3.2.2. Multivariate quadrature rules. Suppose that X{ = (xn, ...,Xid) T is 
a multivariate random vector (i.e., d > 1). In this quadrature rule 

can be generated recursively with one-dimensional conditional quadrature 
rules. The algorithm is summarized as follows: 

1. Set s = 1. Compute the marginal distribution of xn and generate a 
quadrature rule for Xn by using the method for univariate random 
variables. 

2. Let {z[ , Zm]} and {n[ s \ itm s } be the quadrature rule generated 
for the marginal distribution of (xa, Xi s ) T . For each z- , 1 < j < 
m s , compute the one-dimensional conditional distribution of 

[xi( s +i)\{%a, -,x is ) T = z^ s) ] 

Then generate a quadrature rule for this distribution, denoted by 
{z* v z* n .} and {tt*!, n* n .}. Then {((z^f, z* r f , 1 < r < n jt 1 < 

j < m s } and -^j r , 1 < r < rij , 1 < j < m s } compose a quadrature 
rule for the marginal distribution of (xn, Xi s , Xj( s+1 )) T . 

3. Set s = s + 1. Repeat step 2 until s = d. 

The order that x^ 's jump into the algorithm is not important. One may re- 
arrange the order to simplify the computation of the quadrature rules. From 
our experience, a quadrature rule with 7 to 12 nodes for each component 
of Xi usually yields a very good approximation. In this case, the above EM 
algorithm usually converges very rapidly. 

3.3. Choice of the smoothing parameter. 

3.3.1. The comparative KL distance and leaving- out- one- subject CV. So 
far the smoothing parameter A, is assumed to be fixed. Choice of A is a 
key problem in the penalized likelihood regression. For non-Gaussian data, 
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Kullback-Leibler (KL) distance is commonly used as the risk function for 
the estimator f\ 



\h 



where /* denotes the true regression function and the expectation is taken 
over y® ~ p(y\f*) independent of yi. In order to estimate KL(/*, f\), Xiang 
and Wahba (1996) [36] proposed generalized approximate cross vali- 
dation (GACV) beginning with a leaving-out-one argument to choose the 
smoothing parameter, which works well for Bernoulli data. Lin, Wahba, Xi- 
ang, Gao, Klein and Klein (2000) [28] derived a randomized version of GACV 
(ranGACV) which is more computationally friendly for large data sets. In 
this section we obtain a convenient form of leaving-out-one-subject CV for 
randomized covariate data and extend GACV and randomized GACV to 
randomized covariate data in subsequent sections. 

In the situation when each observed covariate is actually a probability 
space (Xi, Fi,Pi), [yf\f] has a density of 

(3-5) p(Vi\f)= [ p(v?\xi,f)dPi. 

Following (3.4) and leaving out the quantities which do not depend on A, 
the comparative KL (CKL) distance can be written as 

(3.6) CKL(A) = { log J x ex P {y°/ A (x i ) - &(/*(*<))} dP, t 
To simplify the notation, let's denote 

(3.7) L(y, f, Pi) = log / exp {yf( Xi ) - b(f( Xi ))} dP { 

the log-likelihood function for randomized covariate data. Using first order 
Taylor expansion to expand L at the point yi, we have that 

dL 

(3.8) L{ylh,Pi) « L(yi,f x ,Pj + (y° - Vi ) — (y u f x , Pi). 

dy 

Direct calculation yields 

dL j p ^ f Xi f\(xi) exp {yif\(xi) - b(f x (xi))} dP-i 
dy ' J Xi exp{yif x (xi) - b(f x (xi))}dPi 

(3-9) = E Xi]yiJ J x ( Xi ). 
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Plugging (3.8) and (3.9) into (3.6), we have that 

1 n 

CKL(A) « OBS(X) + -Y.EyO^iyi-y^E^jJxixi) 

i=i 

1 ™ 

(3.10) = OBS(X) + -Y / (yi-^)E Xi \y i>h fx{x i ) 



n 
i=i 



where fi* = E y o\j*y® is the true mean response and 
1 n r 

(3.11) OBS(A) = — Vlog / exp{y i h(x i )-b(f x (x i ))}dP i 

is the observed log-likelihood. Denote /{ ^ the leaving-out-one estimator, 
i.e., the minimizer of I^(f) with the ith subject omitted. Since E x .\ y .j x f\(xi) 
is the posterior mean estimate of f*(xi), following Xiang and Wahba (1996) [36] , 
we may replace ^*E Xi \ yi jJ x (xi) by yi E x .\ yi jl-nf{ (xi) and define the leaving- 
out-one-subject cross validation (CV) by 

1 n 

(3.12) CV(A) = OBS(A) + -J2yi(E Xi]yi , f Jx(x i )-E , M /r 1 (xi)). 



It can be seen that (3.10) and (3.12) generalize the complete data CKL 
and CV formulas proposed in Xiang and Wahba (1996) [36]. If fx denotes 
the QPLE estimate, then we may further approximate (3.12) by quadrature 
rules. More specifically, OBS(A) can be evaluated by 

-, n mi 

(3.13) OBS(A) = — ^log^7ryexp{ yi / A ( % ) - 6(/a(^))} 

n i=l j=l 

where Zy's and 7Tjj's represent nodes and weights of the quadrature rules 
given in the preceding section. Define the weight functions 

/oi^ i\ Try exp {t/jTj - b(rj)} . 

(3.14 Wij ( T ) = ^> ^> J- — j = 1, ...,rrii 

Lfc ^ik exp {yiT k - b(r k )\ 

where r = (n, ...,T m J T is an arbitrary vector of length m,j. Let us use the 
notations 

(3.15) fx i = (f x (z il ),...,fx(z im% )) T 

(3-16) fT ] = (fV ] (za),...jY ] (^)f. 
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Then (3.9) yields 

rrij rrii 

(3.17) E Xi \ yi jJ X {xi) « ^2wij(f\i)f\(zij) = ^W^ijfxiZij) 

j=l 3=1 

rrii rrii 

(3.18) E^jwfMixi) « E^(/a^)^(%) = E^/Jf^) 

j'=i i=i 

where WA,ij = Wij(fxi) and w^^- = equal the weights at the final 

iteration of the EM algorithm, respectively, when f\ and f[ were com- 
puted. Therefore a more convenient version of CV can be obtained as 

-i n rrii 

(3.19) CV(A) « OBS(A) + - Vi EK,ii/A(2ii) - 

71 i=l 3=1 

3.3.2. Parametric formulation of /_f ,n . Based on (3.19) and by using 
several first order Taylor expansions, a generalized approximate cross val- 
idation (GACV) can be derived for randomized covariate data. Before we 
proceed, we would like to establish some notations. 

As we previously discussed, I^' n (f) can be formulate parametrically as 

-, n rrii » 

(3.20) /fW) = —J2^gJ2^p(yi\h) + o/ Ts */ 

n i=i j=i 

where / = (/n, f lmi , / 2 i, fnm n ) T denotes the vector of / evaluated at 
{Zij,l < i < n, 1 < j < rrii}, y = (y/ ...,yJ) T with y { = (y h ...,yi) T be- 
ing rrii replicates of yi and T,\ is the positive semi-definite matrix satisfying 
AJ(/) = / T E A /. Note that minimizing I x ' (/) in % is equivalent to mini- 
mizing /f' n (y, / ) in Rmi+..-Hnn. Hence / A = (A(*n), A(*iml), A(*2i), -, 
f\(z nrrln )) T minimizes (3.20). Similarly, we can denote f^ ^ = (/| ! '(zn), 

/i^k^imi), /I~ l] (^2i), fx~ % \ z nm n )) T the minimizer of (3.20) with ith sub- 
ject omitted. 

3.3.3. Generalized average of submatrices, randomized estimator. To de- 
fine the GACV and randomized GACV we use the concept of generalized 
average of submatrices and its randomized estimator introduced in Gao, 
Wahba, Klein and Klein (2001) [14] for the multivariate outcomes case. Let 
A be a square matrix with submatrices An, 1 < i < n on the diagonal. 
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Denote An = (a l st ) miXmi ,l < s,t < mi. Because A^'s may have different 
dimensions, we calculate for each An 



(3.21) 

and 
(3.22) 



ran 



1 



fc=l j=l 



nrrij 



■tr{A) 



0. 



if rrij 



1 



if rrij > 1. 



VKK - !)) ELi X^t a «f 

Then the generalized average of An is defined by 

/ Si 7i ••• 7» \ 
7i ^ ••• 7; 



(3.23) 



((5, - 7i)J„ 



+ 7i • e i e j 



\ 7i 7* 



where = (1, 1..., 1) T is the unit vector of length rra,. In this case, the inverse 
of An can be easily obtained by 

1 T H T 



(3.24) 



A: 



7i 



l m,iXmi 



{Si - li){Si + (mi - l)7j) 



Now we discuss how to obtain a randomized estimator of An. Let e = 
(ef , e^) T , where ej = (e^, £i mi ) T with each generated independently 
from N(0, a 2 ). Denote e = (ei, ei, £2, Z n ) T the corresponding mean vec- 
tor with rrii replicates of Zi for each 1 < i < n, where = e *r 
Then we observe the following facts 

(3.25) £e T A = a • tr(A) 

(3.26) 



E {e T Ae-e T Ae}=a-Y J Y, a 's 

k=l s^t 

Thus, a randomized estimate of An can be obtained by replacing 5i and 7$ 
with their unbiased estimates — - — e T A and — rr- (e T Ae — e T Ae). 

3.3.4. The GACV and randomized GACV. We now present the result 
of GACV as follow. Details of the derivation can be found in Appendix B. 
Denote H the influence matrix of (3.20) with respect to / evaluated at f\. 
Write 

/ Hu * * * \ 
* H22 ■ ■ ■ * 



(3.27) 



H 
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where each Ha is a rrtj x mi submatrix matrix on the diagonal with respect 
to to (fii,...,fi mi ) T . Define = diag(6"(/ A (^i)), -, ^'(AOimJ)) the diag- 
onal matrix of estimated variances. Let W = diag(Wi, W n ) be the "big" 
variance matrix for all the observations. Denote G = I — HW with subma- 
trices Ga = I mi xmi — H^Wi, 1 < i < n on the diagonal. Now let Ha and 
Gu denote the generalized average of submatrices Ha and Ga. Then the 
generalized approximate cross validation (GACV) can be written as 
(3.28) 

y% - fr\{zii) 

GACV(A) = OBS(A) + -J2yi(d a , ...,d imt )G^H v , 



where jl\{zij) = b'{f\{zij)) denote the estimated mean response and 



(3.29) dij = w\. 



k=l 



In practice, however, computation of the influence matrix H for large 
data sets is expensive and may be unstable. Note that, in order to compute 
Ha and Ga, we only need the sum of traces and the sum of off-diagonal 
entries of Ha's and G^s. Therefore, the exact computation of H and G 
can be avoided using randomized estimates of Ha and Gu. To do this, 
we first generate a random perturbation vector e = (ej , e^) T , where 
f-i = (cji, ■■■,£im l ) T and ejj's are iid from N(0,a 2 ). Then compute the mean 
vector e = (ei, e\, e 2 , e n ) T where e, = l/^/m* e ij- Denote f x y+£ 
and f x y+S the minimizers of (3.20) with the perturbed data y + e and y + e. 
Similarly, denote f x {= fx) the minimizer with the original data. To ease 
the computational burden, we can set f x as the initial value for the EM 
algorithm of f x v+e and f\ +e - Because H is the influence matrix, we have 
that 

(3-30) « // + He, fy +l « fy + He. 



This yields 



(3.31) e T He*e T (f*+*-fA e T He » - 



Thus, a randomized estimate of Ha can be obtained as we previously de- 
scribed. Also it is straightforward to show that 

(3.32) e T Ge^e T e-e T W(fy + *-f x y), e T Ge » e T e - e T W(f/ +s - ?*) 
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which implies a randomized estimate of Ga. In order to reduce the vari- 
ance of randomized trace estimates, one may draw R independent pertur- 
bation vectors e 1 , ...,e R and compute for each e r the randomized estimates 
H^, 1 < i < n and G r u , 1 < i < n. Then the (i?-replicated) ranGACV func- 
tion is 



(3.33) 



R n 



ranGACV(A) = OBS(A)+^-^^y l (d ll ,...,d im J0[ l )- 1 ^ 



nR 

r=l t=l 



4. Covariate measurement error (model). Covariate measurement 
error is a common occurrence in many experimental settings including sur- 
veys, clinical trials and medical studies. Suppose that X\ = (xn, Xid) T 
takes values in the real space W 1 . In the presence of measurement error, 
xi is not directly observed but instead xf r = Xi + Uj is observed, where 
"Ui, 1 < i < n are iid random errors, independent of (yi,Xi). To estimate the 
regression function, our idea is to treat measurement error as a special case 
of randomized covariates. More specifically, each Xi is considered as a ran- 
dom vector distributed as xf r — When the error distribution is known, 
the distribution for x% can be obtained immediately, and therefore RC-PLE 
can be directly employed without any extra effort. 

However, in practical applications, we often face the case that the error 
distribution is unknown. One common approach in the measurement error 
literature is to assume a parametric model for the error density and to 
estimate the unknown parameters from the data. Let p(ui\9) denote the 
specified error density indexed by a real vector 9 ranging over C ~R q 
and let F(ui\9) denote the corresponding c.d.f. function. Since our goal is 
to estimate the regression function, 6 is treated as a nuisance parameter. 
Given (f,9), yi has a marginal density of 



(4.1) p(Vi\f,0)= piVilxr-tHJMuiWdui. 

Jw. d 

Thus RC-PLE can be extended by 

(4.2) lf(/,0) = -ij>g / p{ yi \xt rr -UiJ^u^dui + hu). 

In this case, we still need Assumption A.l to obtain the existence of the 
penalized likelihood estimate. In addition we state the following extra as- 
sumption which can be satisfied with most parametric models for the error 
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distribution. 



ASSUMPTION B.l. The c.d.f. function F(u\0) is continuous in 9 for any 
u G R d and the parameter space G is compact. 

Now we can show the existence of penalized likelihood estimate by the 
following Theorem which is actually a corollary to Theorem 2.2. 

THEOREM 4.1. Under A.l, A.2 and B.l, there exist f\£H and 9\ G 
such that If {f x , X ) = inf/ 6 «,0 6 e if (/, 9). 

Proof See Appendix A. □ 

5. Covariate measurement error (computation). In order to com- 
pute an estimator, we extend QPLE described in Section 3.1 as follows. De- 
note (f^\0^) the parameters estimated at iteration t. Let Zj,l < j < m 

and TVj , 1 < j < m denote the quadrature rule based on the density func- 
tion p(u\6®). Note that the quadrature rules can be generated using the 
method introduced in Section 3.1. It is not hard to see that the E-step at 
iteration t + 1 is to compute the expectation of the penalized likelihood 
— - Y^i=i log p( j/i | rr — Ui, f)p{ui\9) + §</(/) with respect to the conditional 
distributions [ui\yi, xf r , 9^'], 1 < i < n. Using the quadrature rule, 
each [ui\yi,xf" r , f^ l \9^} can be approximated by a discrete distribution 
with support {z^\l < j < m} and mass function P(ui = Zj) = wfj, 
where 

(5.i) w y> - 



13 E^Vwizr-*?./®)' 

Thus the E-step can be written as 

1 n m , 

q(/, = - e E«# • kgpfaNr - *f>/) - £ J (/) 

71 i=l 3=1 



(5.2) +-EE 4- • ^g^fi )- 

n i=l 3 = 1 

Then the M-step maximizes 

Q(f,0\f {t) ,0 (t) ), which can be done by sepa- 
rately maximizing a complete data penalized likelihood of / 



( 5 - 3 ) -EE w lf ■ l °zp(yi\ x ? r - /) - 2 J(/) 

■ =1 3=1 
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and a complete data log likelihood of 9 



1 



n m 



(5.4) 



n 



^^^•logp(^V). 



i=i j=i 



Therefore the M-step becomes a standard problem which can be solved by 
much existing software. When the EM algorithm converges, we will obtain 
the QPLE estimate (f\,9\). 

Finally, we show how to select the smoothing parameter A in the case 
of covariate measurement error. Note that our goal is to construct a good 
estimator of /, and 9 is treated as a nuisance parameter. In other words, we 
only care about the goodness of fit of the f\. Therefore A can be selected in 
the same way as randomized covariate data. To do this, we first estimate the 
error distribution by p{ui\9\) and then determine each covariate distribution 
Pi according to the relation Xi = xf r —Ui. After that, the method introduced 
in Section 3.3 can employed directly for the choice of A. 

Correcting for measurement error is a broad statistical research topic. In 
the interest of space, we only discuss the situation when we have a para- 
metric model for the error distribution. It would be possible to extend our 
method to other situations of measurement error. For example, when ad- 
ditional data is available, such as a sample from the error distribution or 
repeated observations for some Xi, we may estimate the error distribution 
more accurately by using other approaches. Also, sometimes, the parametric 
model p(ui\9) may not be available and in this case, we may want to esti- 
mate the error distribution nonparametrically. These are interesting topics 
for future research. 

6. Missing covariate data (model). Now we describe penalized like- 
lihood regression with missing covariate data. We assume the missing mech- 
anism to be missing at random. 

6.1. Notations and model. Let Xj = (xn, Xid) denote the vector of 
covariates ranging over a subspace of M. d . By the idea of Ibrahim's method 
of weights (Ibrahim, 1990[18] and Ibrahim, Lipsitz and Chen, 1999[19]), we 
first assume a parametric model for the marginal density of Xi, denoted as 
p(xi\9) > 0, where 9 £ C is a real vector of indexing parameters. Here 
9 is treated as a nuisance parameter. 

Write X{ = (x° bs , x™ 4S ) where x° bs is a vector of observed components and 
x™ ls is a di x 1 vector of missing components. Following Little and Rubin, 
(2002) [29], the likelihood of (/, 9) can be obtained by integrating or summing 
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out the missing components in the joint density for (yi,Xi) 

n . 

(6.1) L(/,0) = y>g / p(yi\x i ,f) P (x i \e)da-r' 

where J Rdi p(yi\xi, f)p{xi\6)dx r [ lls = p(y%\xi, f)p{xi\9) if x\ is completely ob- 
served. Then (/, 6) can be estimated by minimizing the following missing 
data penalized likelihood: 

(6.2) lf(/,0) = -- f>g / P (y t \x u f)p(x t \8)dxr s + ^J(f)- 

We note that this method can be viewed as an extension of RC-PLE. De- 
fine Pi m i s the probability measure over M. d * , with respect to the conditional 
density of [ x ™ s |x° fcs ] 

(0.3) pixTW) = ] jl^^, 4* e *. 

Note that (6.3) is well-defined since L di p(xi\6)dxf lls < oo from the Fubini's 
Theorem. Let 



(6.4) 5 x?b s(A) 



1 if xf s e A 
if xf s <£ A 



denote the dirac measure defined for x° bs . Consider the product measure 
= ^x oba x Pfmis which satisfies that for any Borel sets A\ C IR^ - ^, 
A2 C M. di and their Cartesian product A\ x A2, we have 

(6.5) Pi (M x A 2 ) = 5 xts {A x ) ■ Pl mis {A 2 ). 

Then it is not hard to see that 
(6-6) 

ix(f,o) = --X>g / p(vi\ x iJ) dp i+^ J (f)--ib lo % I p&\o) dx r is 

jRd 2 J R di 

is composed of a randomized covariate penalized likelihood of / and a log- 
likelihood of 6. Hence missing covariate data can be treated as a special case 
of randomized covariate data, allowing covariate distributions to be flexible. 
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6.2. Existence of the estimator. The following assumptions can be easily 
satisfied in the most experimental settings. 

ASSUMPTION M.l. V\ = {x™ is £ R d > : p(xi\6) > 0} is compact for all 
1 < % < n and 9 G 6. 

ASSUMPTION M.2. The density function p(x\9) is continuous in 6 for 
any x £ M. d and the parameter space O is compact. 

The existence of the penalized likelihood estimate can be guaranteed by 
the following Theorem which is actually a corollary to Theorem 2.2. 

THEOREM 6.1. Under A.l, A. 2, M.l and M.2, there exist fx^U and 
6 x ee such that Iff (fx, Ox) = mf fen , ee@ lx(f,0). 

Proof See Appendix A. □ 

7. Missing covariate data (computation). In order to compute an 
estimator, we can extend QPLE in the same way as covariate measurement 
error. Denote (f^,6^) the parameters estimated at iteration t. Let zf^, 1 < 

j < nii and iTjj , 1 < j < mi denote the quadrature rule based on the 

probability measure Pf ' defined in (6.5). Then the E-step at iteration t + 1 
can be written as 

-, n rrii . 

QUAf (t \e®) = ^EE4 } •i°gp(^S ) >/) - ^J(f) 

i=l j=l 

1 n mi 

(7.i) +-EE4- •iog^gV) 

i=l j=l 

where 

Then the M-step can be done by separately maximizing 

( 7 - 3 ) -EE4 } -log^i^'/) - o J (/) 

Jl . - . 1 ^ 

4 = 1 J = l 
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and 

-, n rrii 

(7.4) -££«$-i°gp(*gV) 

t=l i=l 

which is computationally straightforward assuming the log-concavity of p(x\9) 
as a function of 9. Again, when the EM algorithm converges, the QPLE es- 
timate (f\,9\) can be obtained. 

In order to select the smoothing parameter, we note that 9 is a nuisance 
parameter and the choice of A only depends on the goodness of fit of f\. 
Therefore, we may select A in the same way as randomized covariate data. 
This is straightforward, since we can take P? x defined in (6.5) as the covari- 
ate distribution. After that the method in Section 3.3 can employed directly. 

Following Ibrahim, Lipsitz and Chen (1999) [19], our method can also 
be extended to the non-ignorable missing data mechanism. In this case, 
we may specify a parametric model for the missing data mechanism and 
incorporate it into the penalized likelihood. The extension is similar but 
more complicated. Thus this is another topic for future research. 

8. Numerical Studies. In this section, we illustrate our method by 
several simulated examples with covariate measurement error and missing 
covariates. For each simulated data set, we will compare: (a) QPLE; (b) full 
data analysis before measurement error or missing covariates; and (c) naive 
estimator that ignores measurement error or leaves out the observations with 
missing covariates. Note that the choice of the smoothing parameter has 
strong effect on the penalized likelihood estimator. Hence in order to show 
the potential gain of our method, for each data set, A is selected by both 
ranGACV and the optimal value that minimizes the Theoretical Kullback- 
Leibler distance (TKL), which does not depend on the nuisance parameter 
9. 

(8.1) TKL = iV%. p (log PMhllO-] 

where /* is the true regression function, / denotes its estimator and Xi 
denotes the true covariate vector before measurement error or 'missing'. 
Note that tuning by minimizing TKL is only available in a simulation study 
when the "truth" is known. 

Our numerical studies focus on Poisson distribution and Bernoulli distri- 
bution which are also the cases in our real data set. The goal is to illustrate: 

• the gain of QPLE; 



22 



MA, DAI, KLEIN, KLEIN, LEE AND WAHBA 



• the performance of ranGACV; 

• the robustness of QPLE to the choice of quadrature rules. 

All the simulations are conducted using R-2.9.1 installed in Red Hat Enter- 
prise Linux 5. 

8.1. Examples of measurement error. Cubic spline regression is perhaps 
the most popular case of penalized likelihood regression. We consider the 
following examples from Binomial and Poisson distributions: 

(i) p(y\x) = ( y ) P(aO w (l " P(x)?- y , V = 0, 1, 2, where 

p(x) = 0.63x cos(2-7tx) + 0.36; 

(ii) p(y\x) = A(x) y e~ A ( x ^/y\, y = 0, 1, 2..., where 

A(x) = 16e- 18 (*-°- 4 ) 2 - Se- 7 ^- ' 5 ) 2 + 5; 

(iii) Same distribution as (ii) except 

A(x) = 10 6 (x n (l - x) 6 ) + 10 4 (x 3 (l - x) 10 ) + 2 

which is a modification of Example 5.5 of Gu (2002) [17]. 

In each case, we take X ~ U[0, 1] and generate a sample of n = 101 (x,y) 
pairs. For each sample generated, measurement errors are created with the 
following scheme. We first randomly select five (x, y) pairs as complete ob- 
servations and then in the rest of the 96 pairs, random errors are generated 
by Xj + m, where Uj's are iid either N(0, a 2 ) or U[— S, 5] for various values of 
the noise-to-signal ratio var(u)/var(X). For each generated data set, QPLE 
is conducted using either the Gaussian quadrature rule or the grid quadra- 
ture rule, where the Gaussian quadrature rule is computed by the statmod 
package in R-2.9.1. Note that we generate the same number of nodes for 
each noisy Xj. Simulation results are summarized by the following figures. 

Figure 1 shows the estimated curves from one simulated data set of case 
(i) with normal error and var(w)/var(A) = 0.25. QPLE is computed via 
Gaussian quadrature where 11 nodes are created for each noisy Xj. Panel (c) 
plots for each regression method the box plot of theoretical Kullback-Leibler 
distances (8.1) calculated from 100 repeated simulations. We also report in 
(d) the TKL distances calculated in the same simulation setting except that 
u is uniform (with the same noise-to-signal ratio). 
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(a) Normal errors, TKL Tuning (b) Normal errors, ranGACV Tuning 




(c) TKL/ranGACV, normal errors 

□ Tuned by TKL 

□ Tuned by ranGACV ° 



Full QPLE Naive Full QPLE Naive 



(d) TKL/ranGACV, uniform errors 

□ Tuned bycTKL ° 

□ Tuned by ranGACW 



Full QPLE Naive Full QPLE Naive 



Fig 1. Estimated curves and TKL distances for case (i). Panels (a) and (b) compare the 
target (True) curve, and three estimated curves obtained from the full data analysis (Full), 
the QPLE estimate, and the Naive estimate, (a) Tuning: TKL, (b) Tuning: ranGACV. In 
(a) and (b) u ~ N(0, 0.145 2 ), assumed known. Panels (c) and (d) provide plots of TKL 
distances, (c) u ~ N(Q, 0.145 2 ), assumed known, (d) u ~ U[— 0.25, 0.25], assumed known. 
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Remark 1 Throughout Section 8, the choice of the curves to display from 
the various 100 simulations is primarily subjective but deemed to be typical 
of the bulk of the visual images of the comparisons between the estimates. 
An idea of the scatter in the TKL distances over the 100 simulations may 
be seen in the box plots. 

Figure 2 shows the estimated curves from one simulation for case (ii) with 
uniform error and var(u)/var(A) = 0.3. We assume that 5 is unknown when 
QPLE is conducted. At each EM iteration, we use Gaussian quadrature and 
create 9 nodes for each noisy ccj. Panel (c) shows the TKL distances from 
100 simulations. Panel (d) is obtained in the same simulation setting except 
that u is normal (with the same noise-to-signal ratio), a is unknown. 

Our results indicate the significant gain of QPLE, when the smoothing pa- 
rameter is selected by either TKL or ranGACV. As we previously discussed, 
QPLE incorporates the information about the error distribution and hence 
is more informative. Generally speaking, when measurement errors are ig- 
nored, the estimated curve of naive method tends to be oversmoothed and 
more biased near the modes and boundaries. Similar phenomenon has been 
noted for other nonparametric regression methods, for example, Local poly- 
nomial estimate, as in Delaigle, Fan and Carroll (2009) [11]. For the choice of 
smoothing parameter, the proposed ranGACV inherits the property of tra- 
ditional ranGACV. As simulations suggest, it is capable of picking A close 
to its optimal value even when 6 is estimated. 

We summarize the influence of quadrature rules on QPLE at Figure 3, 
using case (hi) with normal error and var(it)/var(X) = 0.25. In the computa- 
tion, var(u) is assumed to be unknown and A is selected by TKL. We consider 
four QPLE estimators (QPLE1, QPLE2, QPLE3 and QPLE4) computed 
via, respectively, Gaussian quadrature, grid quadrature, Gaussian quadra- 
ture when u is wrongly assumed to be uniform and grid quadrature when u 
is wrongly assumed to be uniform. We first compare these quadrature rules 
by setting the number of nodes (for each noisy Xi) to be 11. The top two 
panels show the estimated curves from one simulation and panel (c) reports 
the TKL distances calculated from 100 simulations. Then we study the influ- 
ence of the number of the nodes. On panel (d), we plot for each quadrature 
the mean TKL distance (based on 100 simulations) versus the number of 
nodes. From the simulation results, we observed no significant difference be- 
tween Gaussian quadrature and grid quadrature, though, as we expected, 
Gaussian quadrature is more efficient. Surprisingly, even with a wrong er- 
ror distribution prespecified, the potential gain of QPLE is still significant. 
Hence we may say that QPLE is robust to the choice of the quadrature. We 
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(a) Uniform errors, TKL Tuning 




(b) Uniform errors, ranGACV Tuning 
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(c) TKL/ranGACV, uniform errors 



(d) TKL/ranGACV, normal errors 



□ Tuned by TKL 
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□ Tuned by TKL 
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Fig 2. Estimated curves and TKL distances for case (ii). Panels (a) and (b) compare the 
target (True) curve, and three estimated curves obtained from the full data analysis (Full), 
the QPLE estimate, and the Naive estimate, (a) Tuning: TKL, (b) Tuning: ranGACV. 
In (a) and (b) u ~ U[— 0.273, 0.273], 5 = 0.273 assumed unknown. Panels (c) and (d) 
provide plots of TKL distances, (c) u ~ U[— 0.273, 0.273], 8 — 0.273 assumed unknown, 
(d) u ~ N(0, 0.158 2 ), a = 0.158 assumed unknown. 
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also note that QPLE does not require a large number of quadrature nodes 
to compute a good estimator. There is not much gain to create more nodes if 
we already have enough. Hence, in our numerical experiments, we generally 
compute 7-12 nodes for each noisy or missing component in the covariates. 



(a) Normal errors, correctly assumed normal 
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(b) Normal errors, incorrectly assumed uniform 
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(c) TKL distances by quadrature example 
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(d) TKL distances by number of nodes 
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Fig 3. Estimated curves and TKL distances for case (in), u ~ iV(0,0145 2 ), assumed 
unknown. Tuning: TKL. Panels (a) and (b) give the target curve, and estimated curves 
from Full and Naive estimate. Panel (a) compares the Gaussian quadrature (QPLE1) and 
the grid quadrature (QPLE2) when the errors are correctly assumed to be zero-mean normal 
(with unknown variance), and panel (b) compares the Gaussian quadrature (QPLE3) and 
the grid quadrature ( QPLE4 ) when the errors are incorrectly assumed to be uniform ( with 
unknown range); (a) and (b) use 11 nodes. Panel (c) plots TKL distances, using 11 nodes. 
Panel (d) plots mean TKL versus number of nodes. The dotted upper and solid lower lines 
represent the mean TKL for the naive method and the full data analysis. 



8.2. Examples of missing covariate data. In this section, we consider 
Franke's "principal test function" 



T f x ^ = £ e -((9x 1 -2) 2 + (9x 2 -2) 2 )/4 + £ e -((9x 1 +l) 2 /49+(9x 2 +l) 2 /10) 
5.2) + I e -((9zi-7) 2 + (9x 2 -3) 2 )/4 _ l e _(( 9:Cl -4) 2 + (9x 2 -7) 2 ) 

2 5 



PENALIZED LIKELIHOOD WITH RANDOMIZED COVARIATE 27 



which was used as a test function of smoothing splines in Wahba (1983) [34]. 
T(x) is shown in Figure 4. Consider the following examples 




1.0 1-0 



Fig 4. Franke's principal test function 

(i) Binomial distribution: p(y\x) = ( ^ ^jp{ x ) y {^ ~ p( x )) 5 ~ y , where 

(8.3) p ( x ) = _L ( r(x) + 0.198); 

(ii) Poisson distribution: p(y\x) = A(x) y e~^ x ' /y\, where 

(8.4) A(z) = 15T(x) + 3. 

In each case, we take X = (X\, X2) ~ U[0, 1] x [0, 1] and generate a sample of 
n = 300 observations from the distribution of (Y,X). Afterwards, a missing 
data is created in a way that if y > 3 in case (i) or y > 10 in case (ii), we 
randomly take one of the following actions with equal probability: (1) delete 
X\ only; (2) delete X2 only and (3) delete both x\ and #2- On average, we 
create 47 incomplete observations (out of 300) in case (i) and 61 incomplete 
observations in case (ii). 

We will test our method by thin plate spline regression. In order to im- 
plement QPLE, we specify for x a bivariate normal distribution _/V(/Lt, E), 
where ji = (/ii,/i2) T and E = {(Tij}2x2 (an arbitrary covariance matrix) are 
to be estimated. At each EM iteration, we construct for each incomplete Xj 
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(c) Naive (d) TKL/ranGACV 




Fig 5. Estimated functions of p(xi,X2) and TKL distances for case (i). (a) Full data 
estimate, (b) QPLE estimate, (c) Naive estimate. The X's in (a), (b) and (c) are tuned 
by ranGACV. (d) Box plots of TKL distances when tuned by TKL and by ranGACV. 
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Fig 6. Estimated functions of A(xi,X2) and TKL distances for case (ii). (a) Full data 
estimate, (b) QPLE estimate, (c) Naive estimate. The X's in (a), (b) and (c) are tuned 
by ranGACV. (d) Box plots of TKL distances when tuned by TKL and by ranGACV. 
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a Gaussian quadrature rule, where 11 nodes are created for each missing 
component. Simulation results are summarized at Figure 5 and 6. 

Figure 5 and 6 show the estimated functions where the smoothing pa- 
rameter is tuned by ranGACV. The bottom right panel reports the TKL 
distances based on 100 simulations, when A is selected by TKL and ran- 
GACV. The performance of QPLE is also impressive in the case of missing 
covariate data. Note that most incomplete observations appeared near the 
'peak' of the test function. In this case, if these incomplete observations are 
left out, we will miss the information about the peak, as indicated by the 
naive estimator. On the other hand, by incorporating most information in 
the data including the observations with paritally missing covariates, QPLE 
provides encouraging results, even though we actually specified a wrong co- 
variate distribution. 

8.3. Case study. In this section, we illustrate our method on an obser- 
vational data set that has been previously analyzed, by deleting some co- 
variates, and then comparing our method with the original analysis and the 
naive method of dropping files with missing covariates. 

The Beaver Dam Eye Study is an ongoing population-based study of 
age-related ocular disorders. Subjects were a group of 4926 people aged 
43-86 years at the start of the study who lived in Beaver Dam, WI and 
were examined at baseline, between 1988 and 1990. A description of the 
population and details of the study at baseline may be found in Klein, 
Klein, Linton and Demets (1991) [26]. Pigmentary abnormalities are one of 
the ocular disorders of interest in that study. Pigmentary abnormalities are 
an early sign of age-related macular degeneration and are defined by the 
presence of retinal depigmentation and increased retinal pigmentation. 

Lin, Wahba, Xiang, Gao, Klein and Klein (2000) [28] and Gao, Wahba, 
Klein and Klein(2001)[14] considered only the n = 2545 womem members of 
this cohort. 11.88% of them showed evidence of pigmentary abnormalities. 
They examined the association of pigmentary abnormalities with six other 
attributes at baseline, by fitting a Smoothing Spline ANOVA (SS-ANOVA) 
model. The six attributes are are listed in Table 1. 

Let p(x) be the probability that a subject with attribute vector x at 
baseline will be found to have a pigmentary abnormality in at least one eye, 
at baseline. 

The model fitted was of the form 

(8.5) f{x) = constant + fi(sys) + f2(chol) + fi2{sys, chol) 

+ d age ■ age + d bmi ■ bmi + d horm ■ h{horm) + d drin ■ h{drin). 
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Attributes 


unit 


range 


code 


systolic blood pressure 


mmHg 


71-221 


sys 


serum cholesterol 


mg/dL 


102-503 


chol 


age at baseline 


years 


43-86 


age 


body mass index 


kg/m 2 


15-64.8 


bmi 


undergoing hormone replacement therapy 


yes/no 


yes, no 


horm 


history of heavy drinking 


yes/no 


yes, no 


drm 



Table 1 

Covariates for Pigmentary Abnormalities 



Here x denotes the vector of covariates listed in Table 1 and f(x) is the logit 
form of the probability: f(x) = log i3^jy • 

The data analysis is summarized in Figure 7, which is adapted from Lin, 
Wahba, Xiang, Gao, Klein and Klein (2000) [28] . On each panel, we plot the 
estimated probability of pigmentary abnormalities as a function of chol, for 
various values of sys, age and horm. Note that we only plot for bmi = 27.5 
and drin = no, because bmi has relatively small effect in the fitted model 
while only 152 out of 2585 subjects have drin = 1. Hence Figure 7 is ade- 
quate to demonstrate the estimated association patterns. 

Generally speaking, higher chol was associated with a protective effect. 
However, when chol goes from 250 to 350, a "bump" appears on the es- 
timated curves. This phenomenon provides us a good opportunity to test 
our method. In order to 'hide' the bump, we create a data set with missing 
covariates by deleting some attribute values for those subjects whose choles- 
terol is between 250 and 350. Consequently, 517 incomplete subjects are 
created with values of sys, bmi and horm randomly removed. More exactly, 
30 subjects missed sys, bmi and horm, 109 subjects missed both sys and 
bmi, 118 subjects missed both sys and horm and 260 subjects missed only 
one attribute value. 

We shall first claim that the methodology in this paper can be extended 
to SS-ANOVA models without any extra effort, as illustrated in Appendix 
C. In this case, QPLE can be conducted following Ibrahim, Lipsitz and Chen 
(1999) [19]. We first model the joint covariate distribution via a sequence of 
one-dimensional conditional distributions. Note that (age, chol, drin) are 
always observed and hence we do not need to model them. Also, very few 
subjects have drin = 1, hence drin will be ignored in the modeling. Given 
(age, chol), we adopt a bivariate normal distribution (sys, bmi) ~ N(n,Y,), 
where fi = (fj,±, ^2) with ^ = Ofco + o-ki s V s + a^bmi, k = 1,2 and £ = 
{o~ij}2x2 is an arbitrary covariance matrix, and the a's and S are to be 
estimated. Now conditionally on other attributes, horm is modeled via a 
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age =52, horm=no 



age =62, horm=no 



age =71, horm=no 




100 150 200 250 300 350 400 
cholesterol(mg/dL) 
age =52, horm=yes 



150 200 250 300 350 400 
cholesterol(mg/dl_) 
age =62, horm=yes 



100 150 200 250 300 350 400 
cholesterol(mg/dL) 
age =71 , horm=yes 




n 1 1 1 1 1 r 

100 150 200 250 300 350 400 
cholesterol(mg/dL) 



~l 1 1 1 1 T 

100 150 200 250 300 350 400 
cholesterol(mg/dl_) 



i 1 1 r 

100 150 200 250 300 350 400 
cholesterol(mg/dl_) 



Fig 7. Probability curves estimated from the full data analysis. This figure is adapted from 
Figures 9 and 10 from Lin, Wahba, Xiang, Gao, Klein and Klein (2000) [28]. Each panel 
plots the estimated probability of pigmentary abnormalities as a function of cholesterol, 
for four different values of sys. The six panels correspond to different values of age and 
horm, when drin =no and bmi=27.5 are fixed. 
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logistic regression model 

exp{a 30 + a 31 age + a 32 chol + a 33 sys + a u bmi} 

pinorm = 1) = - F ; — ; ; — T . 

1 + exp{a 30 + a 3 \age + a 32 c/io£ + a 33 sys + a 3i bmi) 

Following this construction of covariate distributions and using the method 
described in Section 3.1, a quadrature rule can be obtained recursively at 
each EM iteration. In the computation, the numbers of nodes generated for 
sys, bmi and horm are 10, 10 and 2 respectively. Results of QPLE are given 
at Figure 8. Figure 9 shows the naive estimator computed over the 2068 
subjects without missing covariates. 



age =52, horm=no 



age =62, horm=no 



age =71, horm=no 



X! CM 



sys 1 57 




--- sys 137 




sys 123 




sys 108 







-Q C\J 



2, o 



n r 

100 150 200 250 300 350 400 
cholesterol(mg/dl_) 
age =52, horm=yes 




n r 

150 200 250 300 350 
cholesterol(mg/dL) 



400 




~i 1 1 1 r 

100 150 200 250 300 350 400 
cholesterol(mg/dL) 
age =62, horm=yes 



1 r 

100 150 200 250 300 350 400 
cholesterol(mg/dL) 
age =71, horm=yes 




150 200 250 300 350 400 
cholesterol(mg/dL) 



i 1 1 r 

100 150 200 250 300 350 400 
cholesterol(mg/dL) 



Fig 8. Probability curves obtained from QPLE. Each panel plots the estimated probability of 
pigmentary abnormalities as a function of cholesterol, for four different values of sys. The 
six panels correspond to different values of age and horm, when drin=no and bmi=27.5 
are fixed. 



Note that only the incomplete subjects contain information about the 
bumps. Consequently, the naive estimator omitted these bumps, leading to 
monotone decreasing probability curves. In words, high cholesterol appears 
to generally lower the risk of pigmentary abnormalities especially in the 
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older, horm = no group, aside from the "bump", from the full data analysis 
shown at Figure 7. However, the naive estimator appears to make this risk 
decrease substantially more rapidly due to missing the "bump" completely, 
while the QPLE did an excellent job of recovering the original analysis- the 
QPLE estimated curves are very close to those of the full data analysis. 
This can be understood from the fact that most of the incomplete subjects 
missed only one or two (out of six) covariates. Hence most information is 
still retained in the missing data. 



-Q CM 



age =52, horm=no 



age =62, horm=no 



age =71, horm=no 



sys 157 
sys 137 
sys 123 
sys 108 



n cm 
So 




100 150 200 250 300 350 400 
cholesterol(mg/dl_) 
age =52, horm=yes 



100 150 200 250 300 350 400 
cholesterol(mg/dL) 
age =62, horm=yes 



100 150 200 250 300 350 400 
cholesterol(mg/dl_) 
age =71 , horm=yes 




100 150 200 250 300 350 400 
cholesterol(mg/dl_) 



100 150 200 250 300 350 400 
cholesterol(mg/dL) 



100 150 200 250 300 350 400 
cholesterol(mg/dl_) 



Fig 9. Probability curves obtained from the naive method. Each panel plots the estimated 
probability of pigmentary abnormalities as a function of cholesterol, for four different val- 
ues of sys. The six panels correspond to different values of age and horm, when drin =no 
and bmi=27.5 are fixed. 



9. Concluding remarks. We have presented a direct extension of pe- 
nalized likelihood regression to the situation when the observed covariates 
are probability spaces. The regression function is estimated by minimizing 
a penalized likelihood that incorporates distributional information of the 
covariates. Numerically, we compute a finite dimensional estimator after 
approximating the integrals in the likelihood function by quadrature rules. 
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Using the same approximation, GACV and its randomized version have been 
derived to select the smoothing parameters. Our method is computationally 
efficient, as it only require a small number of quadrature nodes to obtain 
a good estimate. A direct implementation of our method is to handle in- 
complete covariate data such as covariate measurement error and partially 
missing covariates. In the examples we have investigated, the resulting es- 
timator substantially outperformed the naive estimator and appeared to be 
close to the full data analysis. 

Our methods can also be extend to other regularization settings, for ex- 
ample, the LASSO and support vector machine with hinge loss function 
and L2 penalty. In these cases, it might be more complicated to develop 
a likelihood-based frequentist approach. We would like to investigate these 
extensions in the future research. 

APPENDIX A: TECHNICAL PROOFS 

Proof of Proposition 2.1. Any linear combination of measurable func- 
tions is still measurable. Therefore it suffices to prove that %b is complete. 
Let /i,/2,--. be a Cauchy sequence in Hb and /* be its limit in %. Then 
/i; /2> •■• converge pointwise to /*. Note that the pointwise limit of measur- 
able functions is still a measurable function. Therefore /* £Hb- D 

Now to simply the notation in the proofs of Lemma 2.4-2.6, let's define 
(A.l) l i {t) = y l -t-b{t) + c{y i ) 

the log-density as a function of the natural parameter. Then is strictly 
concave and bounded from above. Therefore there are three possible cases 
of the limit of li(t): 

(A. 2) (1) lim li(t)=li and lim Ut) = — 00; 

t— 00 t— >+oo 

(A.3) (2) lim k(t) = -00 and lim k(t) =~k; 

t— >— 00 t— >+oo 

(A. 4) (3) lim li(t) = —00 and lim li(t) = —00 

t— 00 t— >+oo 

where li = sup t k(t) < 00. 



Proof of Lemma 2.4. Without loss of generality, we suppose that A.l is 
satisfied with the first m cases (hence they are completely observed) . In order 
to show Lemma 2.4, we first prove that under A.l, L(f) = YliLi ^°&P{yi\ x ii f) 
is positively coercive over T~Lo- Suppose to the contrary that this is not true. 
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Then there exists a constant U > and a sequence {gfcj/ceN f= %o with 
llfffcllw = 1 such that 

in 

(A.5) -^2k(k-g k (xi)) < U, fceN. 

i=l 

Since the unit sphere {5 € Hq : \\g\\u = 1} is sequence compact, there 
exists a subsequence {gkj}jeN converging to some g* with = 1- We 

claim that 

{< 0, if i belongs to Case 1 as (A. 2) 
> 0, if i belongs to Case 2 as (A. 3) 
= 0, if i belongs to Case 3 as (A. 4). 

Suppose to the contrary that (A. 6) is not true. If i belong to case (1), then 
g*(xi) = a > 0. Since {<7fc -}jeN converges to g* , there exists N > such that 

(A.7) g kj ( Xl )>a/2, for all j > N. 

From (A.5), we have 

(A.8) h(kj ■ g kj {xi)) > U - ^2~l s < 00, j € N. 

This is a contradiction of (A. 2) since when j > N 
(A.9) kj ■ g k] {xi) > kj ■ a/2 -> +00. 

Similar contradiction can be observed when i belongs to case (2) or case (3). 
Therefore the claim in Equation (A. 6) follows. 

Now let go be the unique minimizer of — Y^I=x h(g( x i)) in Ho- Consider 
9o + r S* with r > 0. Combining (A.2)-(A.4) and (A. 6), we can see that 

m m 

(A.10) -J2k(jgo(xi)+rg m (xi))<-J2lii9o(.Xi)), Vr > 0. 

i=l i=l 

But this is a contradiction. Hence L(f) is positively coercive over T~Lq, which 
means that 

m 

(A.ll) \\g\\ n ->• 00 => -^^(g(jj)) ->• +00, g£U . 

i=i 

Consider the orthogonal decomposition f = g + h where <? G Ho H 
and /i S Hi P| Hg. The Lemma can be proved in steps. 
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(i) \\h\\u — > +00. In this case 

1 n 1 

(A.12) !*(/) > + ^\\h\\n -> +00. 

i=i 

(ii) < i7 for some U > but — >■ +00. In this case 

\h( Xi )\ = \{h,K(; Xi ))\ < \\h\\nK l / 2 (x h Xi) <U-K l l 2 {x h Xi), i = 1,2, ...m 
which implies that 

f(xi) = g(xi) + h(xi) = g(xi) + 0(1), i = l,...,m, \\h\\ n < U. 
Let \\g\\u — > 00, we have 

n 

/£(/) > — Vlog / p^X^ftdPi 

^ m 1 n 

> — y\k{g(xi) + h{xi)) — y~] ij 

i=l j=m+l 

= ~Ew*i)+o(i))-^ £ 7 i 

i=l j'=m+l 

(A.13) +00 



where (A.13) follows from the claim in Equation (A. 11). 
The Lemma is now proved by combining (i) and (ii). □ 

Proof of Lemma 2.5. Let {fk}keN be a sequence in T~Lb which con- 
verges weakly to /*. Since pointwise limit of measurable functions is still a 
measurable function, /* £ He- From the continuity of k(t), {e l ^ k<yXi ^} km 
pointwise converges to e li v*( Xi ^ over Xi. Note that e li ^ k<(X ^ < e li and ev- 
ery constant is integrable with respect to (Afj, J-i, Pj). By the Dominated 
Convergence Theorem, we have that 

(A.14) lim / e hifk(xi)) dPi = [ e kirixi)) dP t . 

fc^oo J x . J x . 

The Lemma now follows since log(-) is continuous. □ 



Proof of Lemma 2.6. Let {/fcjfcgN be a sequence in 7~Lb which weakly 
converges to /*. Consider the orthogonal decomposition of each fk by fk = 
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9k + hk with g k G Hq (~)Hb and h k G "Hi D^s- ^ is straightforward to see 
that {/ifcjfceN weakly converges to h* , the smooth part of /*. Therefore we 
can write 

(A.15) 0<\\h k -h*\\ 2 H = \\h k \\ 2 H + \\h*\\ 2 H -2(h k ,h*). 

Let k — > oo, we observe that 

(A.16) < liminf \\h k \\n - \\h*\\^ 

k 

and the Lemma is proved by definition. □ 



Proof of Theorem 4.1. For any fixed 9 G 9, by Theorem 2.2, if (f,9) 
is minimizable in T~L. Let 

(A.17) r(0)4min/f(/,0) 

denote the minimum penalized likelihood given 9. We claim that T{9) is 
continuous. 

For any sequence {#fc}fceN £ that converges to 9*, let Pg fe and Pg* 
denote the probability measures on W 1 with density functions p(u\9 k ) and 
p(u\9*). Since F(u\9 k ) — > F(u\9*) for any u G M. d , Pg k weakly converges to 
Pg* . Note that, for any fixed / G T~L, G(u) = p{yi\xf r — u, /) is a real-valued, 
continuous and bounded function on R d . Thus f G(u)dPg k — > f G{u)dPg*. 
Equivalently, that is 

(A.18) / p(yi\xf r -Ui,f)p(ui\9 k )dui-t / p(yi\xf r - Ui, f)p(ui\9*)dui 

which implies that I^(f,9) is continuous in 9 for any fixed /. This is suf- 
ficient to prove the continuity of T(9). The theorem now follows from the 
compactness of 0. □. 

Proof of Theorem 6.1. For any fixed 9 G 0, by (6.6) and Theorem 2.2, 
I\f(f, 9) is minimizable in T~L. Thus, we can define 

(A.19) T(0)^minlf(/,0). 

We claim that T(9) is continuous. 

By Assumption M.l and M.2, there exists U > such that p(xi\9) < U for 
all x™ iS G T>f , 9 G and 1 < i < n. Now for any sequence {6*fc}fceN G that 
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converges to 9*, p(yi\xi, f)p(xi\0k) pointwise converges to p(yi\xi, f)p(xi\9*). 
Note that p(yi\xi, f)p(xi\9k) < e li ■ U and any constant is integrable on the 
compact domain T>f. By Dominated Convergence Theorem, we conclude 
that 

(A.20) lim / p( yi \ Xi , f)p{ Xi \e k )dxf s = [ p{ yi \x u f)p{ Xi \e*)dx™ s 
J v e J v e 

which implies that I^f(f,9) is continuous in 6 for any fixed /. This is suf- 
ficient to prove the continuity of T(0). The theorem now follows from the 
compactness of 0. □ 



APPENDIX B: DERIVATION OF GACV 

Our GACV is derived based on the cross validation function (3.19). Let 
us use the notations (3.15) and (3.16). It can be seen from (3.18) that 
Y^jHi ij f\ % \ z ij) can be treated as a function of /J l \ Note that ^ 
is expected to be close to fxi, Thus using the first order Taylor expansion 
to expand Y^=l w \,ijf\ \ Zi i) at we have tliat 

CV(A) « OBS(A) + iVy l (/; i 

n ^-^ 

i=l 

(B.l) = OBS(A) + iVy 4 (/; i 

n 

i=l 

where Wij(r) and dij are defined by (3.14) and (3.29), respectively. Thus, it 

remains to estimate f\i — f Xi • To do this, we first extend the leave-out-one 
lemma (Craven and Wahba,1979[10]) to randomized covariate data. 

LEMMA B.l (leave-out-one-subject lemma) Let l(yi,t) = y% • t — b{t) + c(y) 
be the log-likelihood function and I^' n (y, f) = — Yli=l Sj=i n ij ex P{^(?/j) 
/(%))} + it-J(/)j where y = (y/ ...,y n T ) T with y { T = (y;,...,yi) T being 
mi replicates of yi. Suppose that r = (n, T mi ) T is a mi x 1 vector and 
h\(i, r, •) is the minimizer in H of T A ' (Y, f), where Y = (y± T , y^li, t t , y^ 
...,yJ) T . Then 

(B.2) h x (i,fl^\-) = f [ - i] 



hi ) dT 



f. 



-<KT 
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where minimizes - Y^k^i lo S Yj=i ^kj exp{Z(yi, /(%))} + ^J(f), and 
P\i^ = (^'(/I —i b'{f\ % \ z im.i))) T is the vector of means correspond- 

ing to f [ ~ l] . 

Proof of Lemma B.l. Firstly, we claim that 
(B.3) 

l(b'(f [ - l \ Zij )), /l _t] (%)) > l(b'(ft l] ( Zij )), f( Zij )), 1 < j < m u V/ e 
This follows since 

and using the fact that = -b"(t) < 0. Therefore i(6 , (/i~ ,1 («ij)),t) 

achieves its unique maximum for i = /! (%■)■ 

Define y = (y/, ...,y i 'E 1 , {i^) T , %+i, ...,y n T ) T . Then for any /£?/, 

lf n (y H] ,/) = -log^7r 4 ,exp{/(fe'(/;" i ](^)),/(^))} 



j'=l 



nA 



— log TTfcj exp{Z(2/ fc , /(^fcj))} + ~Y J U) 

k+i j=l 
rrn 

> - log £ ^ exp{Z(6'(/[- 

j'=i 

— log TTfcj exp{Z(?/ fc , + ~Y J U) 

k^i j=l 
rrn 

> - log ^ 7Tjj exp{/(6'(/[- 4] (z 4J )), 

3=1 

-^log^7r fej exp{Z(y fc ,^(^))} + ^-J(^ ] ). 

The first inequality is due to (B.3) and the second one is due to the fact 
that f [ ~ l] minimizes - Yk^i lo S 12f=i ^kj exp{/(y i5 f(z k j))} + ^ •/(/)• Thus 
we have h\{i,j£ x ~ , •) = □ 

Consider the parametric form of the penalized likelihood in (3.20) and 
denote y = (y 1 T , y^, (f^T, yJ) T ■ Then Lemma B.l says 



PENALIZED LIKELIHOOD WITH RANDOMIZED COVARIATE 41 



that/ A [ lJ = (f[ l \z n ),-,fx (zimi),fx ( z 2i), ••-,/! (z n m„)) T minimizes 

I\' U {y [ ~ i] J )• Note that fx = (fx{zn), fx(zi m i), AO21), f\{z nmn )) 
z 

A 



minimizes I Z,U (y,f ). Thus, 



(B.4) ^V(y,/l) = o ) ^ r (y [ " l \fT ] ) = o. 

d f d f 

Using first order Taylor expansion, we have that 

= ^(^-if) 

df 



2 T z,n 



of of of 1 oy of 1 

GT, /1)(/a M - /a) + tt^fGT, )G7 H1 - y) 



df df T A ay a/ 4, 

(B.5) 

where (y*, is a point between (y, fx) and (y /J 

Consider any arbitrary vector / = (fi,-,f%) with /j = (/ a , fi m J T 
being an rrii x 1 vector. For 1 < z < n and 1 < s,t < m>i, let's denote 



-w is (f) [l + (1 - m s (f ))fis(yi - b'(f is ))\ , if s = < 
Wis{f)wit{f)fis{yi - b'(fit)), if s 7^ t 



Wis(f) b"(f is ) - (1 - WisifMyi - b'(f is )) 



, if s = t 



W is (f)w it (f )( yi - b'(f is ))( yi - b'(fu)), its^t. 



Kt(f ) 

4(/) = 



Define submatrices Bi(f ) = [b\ t (f )) and A(/ ) = [d\ t {f ) 

and let = diag(Si(/), ...,S n (/)) and£>(/) = diag(Z>i(/), ...,£>„(/)) 
be block diagonal matrices. Then direct calculation yields 

a 2 / Z ' n 1 2 / Z ' n 1 

Therefore, from (B.5), we have 

(B.7) fx - f x [ - l] = -(D(fx) + n^BUDiy-y HI). 



Approximate B(f^) and D(f x *) by B(fx) and D(fx). Then denote 



(D(fx) + nT>x)~ B(fx) the influence matrix of I A ' (y,f) with respect to 
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/ evaluated at A- From (B.7), we have 



(B. 



Write 



(B.9) 



fxi f\i 



\ fxn f\ n / 



I 



H 







\ 



Vi ~ 



H 



( Hn * * * \ 
* H22 ■ ■ ■ * 



V 



where each Ha is a m; x mi submatrix matrix on the diagonal with respect 
to (fn, fim i ) T . We observe from (B.8) that 



(B.10) 



f^-fW^Huim-flx?)- 



Recall that ^ = {b\f^ i \z il )), ...,b' (f[~ l \z tmi ))) T is a vector of &'(•) 
evaluated at f Xi . Hence, using a first order Taylor expansion to expand 
&'(■) at /ai, we have 



(B.ll) 



^Xi 



m « Wi(Ai iJ - Ai) 



where Wj = diag(6"(A(^ii)) 5 b" (fx(zi mi ))) is a diagonal matrix of vari- 
ances. 

Combining (B.10) and (B.ll), we can show that 
A» ~ fxi ~ ~ /4i ) 

(B.12) « Hu(yi-Pxi + Wi(fxi-f x [ r i] )). 

Now, an approximation of Ai — /J ^ can be obtained by solving (B.12) 

(B.13) hi - fx [ r ] ~ (Wmi - HaWi)- 1 H u (yi - fl Xi ). 
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Plug (B.13) into the CV function (B.l), we obtain the approximate cross 

validation (ACV) function 

(B.14) 

, i n 

ACV(A) = OBS(A) + -V y! (4, .^d imt ){Im^ mi - H^Wi)' 1 H^ft - jt M ) 
n 

where OBS(A) is given in (3.13). Define Gu = I mi xmi ~ HaWi- Then a 
generalized form of approximate cross validation (GACV) can be obtained 
by replacing each Ha and Gu with the generalized average of submatrices 
defined in (3.23). Let Ha and Gu denote the generalized average of Ha 
and Gu. Then the generalized approximate cross validation (GACV) can be 
defined 



, _ i n 

GACV(A) = OBS(A) + -J^Vii^u ■■■^ im JG^ 1 H u (y i - fl Xi ) 

i=l 

- n mi 

(B.15) = ^log^7r ij exp|y i / A (z ii ) -&(A(%))} 

f Vi- Aa(^i) 
+ — ^ yijdg, dim^G^Hg '• 



n 

i=l 



We remark that if all the exactly observed, then the above GACV 

function will reduce to the original GACV formula in Xiang and Wahba 
(1996) [36]. 

APPENDIX C: EXTENSION TO SS-ANOVA MODEL 

Smoothing spline analysis of variance (SS-ANOVA) provides a general 
framework for multivariate nonparametric function estimation. The appli- 
cation is very broad. To extend the methodologies of the paper, it suffices 
to show that the penalized likelihood for SS-ANOVA model can be formu- 
lated in the form of (1.3). The following arguments are derived from Wahba 
(1990) [35]. 

The penalized likelihood of smoothing Spline ANOVA model takes the 
form of 

n b 

(c.i) h(f) = --Y, l °sp(y^ f) + Y, x ^ F ifW 2 ^ 

i=l 13=1 1 

where are nonparametric subspaces (smooth spaces) which are assumed 
to be RKHS with reproducing kernel K^(-,-) and projects / onto H^. 
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Now For A/? > 0, define U\ = Y%=i ® n i with 

norm 

(C2) IfoUk =J2 X ^U\ 2 H ^ri €Wi. 

0=1 1 

It can be shown that Hi is a RKHS equipped with RK £j =1 j-K^{s,t). 
Then we can write that 

1 n 

(C.3) I x (f) = --^og P (yi\xiJ) + ||Pi/||^ 

i=i 

where Pi projects f E H onto Set J(/) = HPi/H^. Then the above 
expression takes the form of (1.3). Therefore our discussion in this paper 
can be extended to SS-ANOVA model. 
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